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Abstract 

Assuming that the loss function is convex in the prediction, we con- 
struct a prediction strategy universal for the class of Markov prediction 
strategies, not necessarily continuous. Allowing randomization, we remove 
the requirement of convexity. 

1 Introduction 

This paper belongs to the area of research known as universal prediction of 
individual sequences (see |2I for a review): the predictor's goal is to compete 
with a wide benchmark class of prediction strategies. In the previous papers 
[TB] and we constructed prediction strategies competitive with the important 
classes of Markov and stationary, respectively, continuous prediction strategies. 
In this paper we consider competing against possibly discontinuous strategies. 
Our main results assert the existence of prediction strategies competitive with 
the Markov strategies. 

This paper's idea of transition from continuous to general benchmark classes 
was motivated by Skorokhod's topology for the space D of "cadlag" functions, 
most of which are discontinuous. Skorokhod's idea was to allow small deforma- 
tions not only along the vertical axis but also along the horizontal axis when 
defining neighborhoods. Skorokhod's topology was metrized by Kolmogorov so 
that it became a separable space (T, Appendix III; [TT], p. 913), which allows 
us to apply one of the numerous algorithms for prediction with expert advice 
(Kalnishkan and Vyugin's Weak Aggregating Algorithm in this paper) to con- 
struct a universal algorithm. 

In Section [3 we give the main definitions and state our main results. Theo- 
rems ^ and |21 their proofs are given in Sections |3| and ^ respectively. 

2 Main results 

The game of prediction between two players, called Predictor and Reality, is 
played according to the following protocol (of perfect information, in the sense 
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that either player can see the other player's moves made so far). 

Prediction protocol 

FORn = 1,2,...: 

Reality announces a;„ G X. 

Predictor announces 7„ S F. 

Reality announces j/n € Y. 
END FOR. 

The game proceeds in rounds numbered by the positive integers n. At the 
beginning of each round n — 1,2,... Predictor is given some signal Xn relevant to 
predicting the following observation y„. The signal is taken from the signal space 
X and the observation from the observation space Y. Predictor then announces 
his prediction 7„, taken from the prediction space F, and the prediction's quality 
in light of the actual observation is measured by a loss function A : F x Y — > M. 

We will always assume that the signal space X, the prediction space F, and 
the observation space Y are non-empty sets; X and F will often be equipped 
with additional structures. 

Markov-universal prediction strategies: deterministic case 

Predictor's strategies in the prediction protocol will be called prediction strate- 
gies. Formally such a strategy is a function 

oo 

D : IJ (X X Y)""^ X X ^ F; 

n=l 

it maps each history {xi,yi, . . . , x„) to the chosen prediction. In this 

paper we will be especially interested in Markov strategies, which are functions 
£) : X — !■ F; intuitively, D[xn) is the recommended prediction on round n. 
The restriction to Markov strategies is not a severe one, since the signal x„ can 
encode as much of the past as we want (cf. footnote 1); in particular, Xn 
can contain information about the previous observations yi, . . . , yn-i- In this 
paper Markov prediction strategies will also be called prediction rules (as in 
jl5| : in a more general context, however, it would be risky to omit "Markov" 
since "prediction rule" is too easy to confuse with "prediction strategy" ) . 

For both our theorems we will need the notion of "approximation" to a signal 
a; € X; intuitively, the "m-approximation" of x is another signal 4>m{x) which 
is as close to x as possible but carries only m bits of information. If X = [0, 1], 
a reasonable definition of [x) would be to take the binary expansion of x but 
remove all the binary digits starting from the (m + l)th after the binary dot. 
In general, we will have to equip X with an "approximation structure" ; we will 
do this following Kolmogorov and Tikhomirov (^2], Section 2, p. 913). 

Consider a sequence of mappings <f>m : X ^ X, m = 1, 2, . . ., such that 
each (pm is idempotent, in the sense 4>m{4'm{x)) — <j)m{x) for all a: G X, and 
0m (X) contains 2™ elements. (Such mappings are coding-theory analogues of 
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projections in linear algebra and contractions in topology; 4>m{x) can be thought 
of as the result of encoding x, sending it over an m-bit channel, and restoring x as 
well as possible at the receiving end.) It is the sequence (j) = {(j)m | m = 1, 2, . . .} 
that will be referred to as an approximation structure. 

If X is a totally bounded (say, compact) metric space, there is an approxi- 
mation structure 4> such that 

lim p{x, (l)m{x)) = (1) 

m — >oc 

uniformly in x G X. (We often let p stand for the metric in various metric 
spaces, always clear from the context.) In fact, the mth Kolmogorov diameter 

/C,„(X) i inf sup diam (0;;^(0,„(a;))) 

of X is essentially the inverse function to the e-entropy He (X) . See jH] for precise 
values and estimates of /Cm(X) for numerous totally bounded metric spaces X. 

A prediction strategy is Markov-universal for a loss function A and an ap- 
proximation structure (j) if it guarantees that for any prediction rule D and any 
m = 1, 2, . . . there exists a number N]j „i such that for any N > N]j „i and any 
sequence xi,yi,X2,y2, ■ ■ ■ of Reality's moves its responses 7„ satisfy 

n—1 n—1 



Theorem 1 Suppose X is equipped with an approximation structure cf), T is a 
closed convex subset of a separable Banach space, and the loss function A(7,2/) 
is bounded, convex in the variable 7 € F, and uniformly continuous m 7 g F 
uniformly in y G Y. There exists a Markov-universal for X and <j) prediction 
strategy. 

A Markov-universal prediction strategy will be constructed in the next section. 
Theorem^says that, under its conditions, 

limsup —^A(7„,y„)-—^A(L»(0™(x„)),?;„M <0 (2) 

^^'^ \ n=l n=l J 

uniformly in xi, yi, X2, y2, ■ ■ ■ for all to = 1, 2, . . . and all D : X ^ F. 

If X is a compact metric space and ^ holds uniformly in a; G X, |2Jl implies 



limsup I — ^ A(7„,2/„) - — '^X{D{xn),yn) 1 

"^^^ \ n—1 n—1 / 



< 



for all continuous prediction rules D; this is close to Theorem 1 in jJH]. The 
advance of this paper as compared to |15| is that our main results do not assume 
that D is continuous. 
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Markov-universal prediction strategies: randomized case 

When the loss function A(7, y) is not required to be convex in 7, the conclusion 
of Theorem n may become false ([Hj, Theorem 2). The situation changes if we 
consider randomized prediction strategies. 

A randomized prediction strategy is a function 

00 

D : U (X X Y)"-i X X ^ V{r) 

n=l 

mapping the past to the probability measures on the prediction space. In other 
words, this is a strategy for Predictor in the extended game of prediction with the 
prediction space 'P(r). A Markov randomized prediction strategy, or randomized 
prediction rule for brevity, is a function Z? : X — > V{T). 

We will say that a randomized prediction strategy outputting 7„ is Markov- 
universal for a loss function A and an approximation structure (fi if, for any 
randomized prediction rule D and any m ~ 1,2,..., there exists Nu„i such 
that, for any sequence xi,yi,X2,y2, ■ ■ ■ of Reality's moves. 



sup 

N>Nd 



/ 1 ^ 1 ^ \ 

^ E ^(-9- Vn) - ]^ E ^(^"'2^") ^ 2— (3) 

\ ri=l ri=l / 



with probability at least 1 — 2 where 31,32, c^i, ^2, ■■ • are independent 
random variables distributed as 

Origin, ~ -D((/)„(x„)), 11=1,2,.... (4) 

Intuitively, the word "probability" after Q refers only to the prediction strate- 
gies' internal randomization; it is not assumed that Reality behaves stochas- 
tically. We will use this definition only in the case where the loss function A 
is continuous in the prediction, and so Q will indeed be an event having a 
probability. 

Theorem 2 Suppose the signal space X is equipped with an approximation 
structure 4>, T is a separable topological space, and the loss function A is bounded 
and such that the set of functions {X{-,y) \ y G Y} is equicontinuous. There ex- 
ists a randomized prediction strategy that is Markov-universal for A and (f>. 

A Markov-universal prediction strategy is constructed in Section 0] The ran- 
domized version of l(2Jl, immediately following from Theorem |21 is 

f 1 ^ 1 ^ \ 

limsup I — E K9n,yn) - ]^ E ^('^"' y») ) - a.s., 

N~^oo Y n— 1 n— 1 / 

for all TO = 1,2,... and all D : X — > F, where gi, g2, . . . , di, d2, . . . are indepen- 
dent and distributed as 
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3 Proof of Theorem [T] 



Let us fix a dense countable subset T* of F. We will say that a function D : 
X ^ r is m- elementary if -D(X) C T* and Dix) depends on x only via 4>m{x)] 
a function is elementary if it is m-elementary for some m. There are countably 
many elementary functions; let us enumerate them as Di, D2, .... We will refer 
to these functions as experts. We will apply a special case of Kalnishkan and 
Vyugin's Weak Aggregating Algorithm (WAA) to the sequence of experts 
(as in [HI). 

Let gi, (72, ... be a sequence of positive numbers summing to 1, X)fc°=i ^fe = 1- 
Define 

N 

to be the instantaneous loss of the fcth expert Dk on the nth round and his 
cumulative loss over the first N rounds. For all n, fc = 1, 2, . . . define 

wi^^ := qkPn"-'\ I3n exp — 



(u'^i*^'' are the weights of the experts to use on round n) and 

(k) 



y-00 (fe) 

(the normalized weights; it is obvious that the denominator is positive and 
finite). The WAA's prediction on round n is 

oc 

ln:=Y.Pn^D'^(^r.). (5) 

k=l 

To make this series convergent, we may take qk :~ and reorder Dk so that 
supj, ||Z?fe(2;)|| < k for all k. In this case we will automatically have 7„ G F since 

k=l l^k=lP^ 

= E(i- ^k^ (,J pWjfc(x„)+ £ pWl?..(x„)-.0 (6) 

k=l \ l^k=lP^ I k=K+l 

as K ~~^ oo. 

Let In ■= MlniUn) bc the WAA's loss on round n and Ljy J2n=i ^" 
its cumulative loss over the first N rounds. 

Lemma 1 (|^, Lemma 9) The WAA guarantees that, for all N — 1,2, . . ., 

N OO N OQ CXD 

n— 1 A;— 1 n— 1 fc— 1 k—1 
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The first two terms on the right-hand side of Q are sums over the first N rounds 
of different kinds of mean of the experts' losses (see, e.g., [5], Chapter III, for 
a general definition of the mean); we will see later that they nearly cancel each 
other out. If those two terms are ignored, the remaining part of Q is identical 
(except that (3 now depends on n) to the main property of the "Aggregating 
Algorithm" (see, e.g., Lemma 1). All infinite series in O are trivially 

convergent. 

In the proof of Lemma ^ we will use the following property of "countable 
convexity" of A: 

oo 

lu<j:P'n^i'n^- (8) 
fe=l 

This property follows from and 

^ ^K^ ik) ^kiXn), yn\ <}2 ^k" (fc) ^ (DkM, Vn) 



if wc let K 



oo. 



Proof of Lemma ^ The proof is by induction on N . For N — 1, ^ follows 
from the countable convexity ^ and = qu ■ Assuming , we obtain 

oo 

Ln+1 = Ln + In+1 < Ln + ^Pjv+i^iv+i 

fc=i 

A^+l oo A'' oo oo 

n=l k=l n-1 k=l k=l 

(the first "<" again used the countable convexity JHl). Therefore, it remains to 
prove 

^og0. E ^ - log^«-.i E PnIiPn^i + logfe+. E • 

k=l fe=l fc=l 

(k) 

By the definition of pn this can be rewritten as 

1 a^N ^ ^ l^k=l IkPN+lPN+l , , a^N+i 



which after cancellation becomes 

j(k) 
IkP] 

k=l k=l 



log^«E'?fc/3> <log^„^,E*/3^+i- (9) 
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The last inequality follows from the general result about comparison of different 
means Theorem 85), but we can also check it directly (following ^). Let 
Pn+i = where < a < 1. Then ijHJl can be rewritten as 



\k=l 



k=l 



(k) 



and the last inequality follows from the concavity of the function i i— > <°. I 

Lemma 2 ([6J, Lemma 5) Let L be an upper bound on |A|. The WAA guar- 
antees that, for all N and K , 



Ln <L 
Proof From j?)), we obtain: 

JV oo N 



(K) 
N 



2„i 



In- 



1 



N. 



(10) 



JV oo N oo / ,(fc)\ 

n=l k=l n=l k=l \ ^ / 



-(K) 



N oo 



N 



n=l k=l 



V 



fe=l 



,(fe) 



1 - 



2n 



- 1 



N 



< L 



n=l V fe=l 



Nln- 



N 



Nln- 



2 ^ V" (IK 

n—l ^ 



< L 



N 



1 

L-'e^ /-^ dt 



Vt 



Nln- 



L^^^+L^e^ViV + V^ln — 



(in the second "<" we used the inequalities e* < l + < + ^e'*' and ln< < t—1). I 

Remark There is no term in 6_ since that paper only considers non-negative 
loss functions. (Notice that even without assuming non- negativity this term is 
very crude and can be easily improved.) 

Now it is easy to prove Theorem^ The definition of Markov- universality can 
be restated as follows: a prediction strategy outputting 7„ is Markov-universal 
if and only if for any prediction rule D, any m = 1, 2, . . ., and any e > there 
exists Nn.m.e such that, for any N > Njj^rn.e and any xi,yi, X2, • • 



^ N ^ N 



n=l 



(11) 
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Let 7„ be output by the WAA and let us consider any prediction rule D, any 
TO e {1,2, . . .}, and any e > 0. Choose S > such that |A(7,?/) — A(7',y)| < e/2 
whenever p(7, 7') < S and choose an w-elementary expert Dk such that, for all 
X e 0m (X), p{Dix),DKix)) < S. 
From Hll)|l we obtain 



n—1 n—1 

^ N ^ N 

< J^YKln,yn) - —^Xl^DKickmiXn)) 
n—1 n—1 

^YK'ln,yn) - J^Y^{Dk{x 
n—1 n—1 
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< L-e^ + ln_ + ; (12) 

qx J VN 2 



now (|ll|l is obvious. 

4 Proof of Theorem [2] 

A convenient pseudo-metric on T can be defined by 

p{9, 9') ■= sup {A(.g, y) - A(.g', y) | y € Y} , g, g' G T 

(cf. Corollary 11.3.4). Let us redefine F as the quotient space obtained from 
the original F by identifying g and g' for which p{g,g') = (0], Section 2.4); 
in other words, we will not distinguish predictions that always lead to identical 
losses. Now p becomes a metric on F. Let F* be a countable dense subset of 
the original topological space F (which is separable as a subset of a separable 
Banach space) ; the condition of equicontinuity implies that F* (formally defined 
as the set of equivalence classes containing elements of the original F*) remains 
a dense subset in F equipped with the metric p. 
We define the norm of a function / : F ^ R as 



l/(g)-/(gOI 



I/IIbl- Bup ^^^^^^^ + sup 1/(5)1; 



this norm is finite for bounded Lipschitz functions (which form a Banach space 
under this norm: see 0, Section 11.2). Notice that 

l|A|lBL-sup||A(-,y)||BL<oo. (13) 

Next define 

A(7,y):=^A(.g,y)7(dg), (14) 
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where 7 is a probability measure on T. This is the loss function in a new game 
of prediction with the prediction space V{r); it is linear and, therefore, convex 
in 7. (In general, the role of randomization in this paper is to make the loss 
function convex in the prediction.) 

As a metric on 'P{T) we will take the Fortet-Mourier metric ([3j, Section 
11.3) defined as 



The topology on V{T) induced by this metric is called the topology of weak 
convergence (J^; weak convergence is called simply "convergence" in 3 ; for 
the proof of equivalence of several natural definitions of the topology of weak 
convergence, see Theorem 11.3.3). 

Let us check that the loss function H14|l is also bounded Lipschitz, in the 
sense of ((T^ : if 7, 7' G Vij) and y G Y, 



It is easy to see that the space V(r) with metric /3 is separable: e.g., the set 
of probability measures concentrated on finite subsets of T* and taking rational 
values is dense in 'P{T) (cf. 1 , Appendix III). Let us enumerate the elements 
of a dense countable set in 7'(r) as • • ■; as in the previous section, we 

will use the WAA to merge all experts Dk- 

The convergence of the mixture jSJ to a probability measure on T is now 
obvious. The countable convexity ||SJ| now holds with equality. 



for bounded Borel / : F — > R, positive pi,p2, ■ • • summing to 1, and Pi, P2, ■ ■ • G 
7'(F) (this is obviously true for simple / and follows for arbitrary integrable / 
from the definition of Lebesgue integral: see, e.g., Section 4.1). 

Therefore, it is easy to check that the chain p2ll still works (with P{r) 
equipped with metric /3) and we can rephrase the previous section's result as 
follows. For any randomized prediction rule D, any m — 1,2, . . and any e > 
there exists ND,m,e such that, for any N > ND,m,e and any xi,yi,X2,y2, ■ ■ ■, 
the WAA's predictions 7„ G 'P(F) are guaranteed to satisfy 






and follows from the general fact that 





(15) 
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(cf. CD)- _ 

The loss function is bounded in absolute value by a constant L, and so the 
law of the iterated logarithm (in Kolmogorov's finitary form, [7], the end of 
the introductory section; the condition that the cumulative variance tends to 
infinity is easy to get rid of: see, e.g., JOIj (5.8)) implies that for any 5 > Q 
there exists such that the conjunction of 



sup 

N>Ns 



N 

E 



{H9n,yn) - Kln,yn)) 



< V2.01L2iVlnlniV 



and 



sup 

N>Ns 



N 

E 

n=l 



< \/2.01L2ArinlniV 



holds with probability at least 1 — S. Combining the last two inequalities with 
p5|l we can see that for any randomized prediction rule D, any m = 1,2,..., any 
e > 0, and any 6 > there exists Nij jn,e,5 such that, for any Xi,yi,X2,y2, ■ ■ 
the WAA's responses 7„ G ViT) to xi, yi, a;2, y2, ■ ■ ■ are guaranteed to satisfy 



sup 

N>No.„^ 



' 1 

^ E^(f"'yn) 



n=l 



1 ^ 

— ^ A(d„,?/„) 



< e 



with probability at least 1 — (5. This is equivalent to the WAA (applied to 
Z3i, D2, ■ ■ .) being a Markov-universal randomized prediction strategy. 



5 Conclusion 

An interesting theoretical problem is to state more explicit versions of Theorems 
n] and 121 for example, to give an explicit expression for iVu „. 

The field of lossy compression is now well developed, and it would be inter- 
esting to apply our prediction algorithms (perhaps with the Weak Aggregating 
Algorithm replaced by an algorithm based on, say, gradient descent p] or defen- 
sive forecasting ^21) to the approximation structures induced by popular lossy 
compression algorithms. 
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